2/2/23
Q: Do you recommend review some statistics before attending lectures?
A: This is very much up to you. It certainly wouldn’t hurt! There will be readings connected to each lecture; these are a great place to start. It can be helpful to work through these readings and the exercises either before or after lecture!
Q: I am still not quite sure how to do HW part 2 and 3, do we download data into the folder, read the file, and plot it?
A: Yup! That’s one approach. The other approach (if the data are not available for download) would be to create the dataset on your own, estimating the values as best you can from the plot and creating a dataframe/tibble containing that information directly in your code (similar to what you did in HW01)….and then plot from there.
Due Dates:
tidymodels\[\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}\]
tidymodelstidyverse packagetidyverse package… using formula syntax
parsnip model object
Call:
stats::lm(formula = Height_in ~ Width_in, data = data)
Coefficients:
(Intercept) Width_in
3.6214 0.7808
\[\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}\]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.62 0.254 14.3 8.82e-45
2 Width_in 0.781 0.00950 82.1 0
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
Remember this when interpreting model coefficients
\[\hat{y}_{i} = \beta_0 + \beta_1~x_{i}\]
\[\hat{y}_{i} = b_0 + b_1~x_{i}\]
\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]
# A tibble: 3,393 × 3
name Height_in landsALL
<chr> <dbl> <dbl>
1 L1764-2 37 0
2 L1764-3 18 0
3 L1764-4 13 1
4 L1764-5a 14 1
5 L1764-5b 14 1
6 L1764-6 7 0
7 L1764-7a 6 0
8 L1764-7b 6 0
9 L1764-8 15 0
10 L1764-9a 9 0
11 L1764-9b 9 0
12 L1764-10a 16 1
13 L1764-10b 16 1
14 L1764-10c 16 1
15 L1764-11 20 0
16 L1764-12a 14 1
17 L1764-12b 14 1
18 L1764-13a 15 1
19 L1764-13b 15 1
20 L1764-14 37 0
# ℹ 3,373 more rows
landsALL = 0: No landscape featureslandsALL = 1: Some landscape features\[\widehat{Height_{in}} = 22.7 - 5.645~landsALL\]
landsALL = 0) to the other level (landsALL = 1)# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# ℹ 3,373 more rows
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
| school_pntg | D_FL | F | G | I | S | X |
|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 0 |
| D/FL | 1 | 0 | 0 | 0 | 0 | 0 |
| F | 0 | 1 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 1 | 0 | 0 | 0 |
| I | 0 | 0 | 0 | 1 | 0 | 0 |
| S | 0 | 0 | 0 | 0 | 1 | 0 |
| X | 0 | 0 | 0 | 0 | 0 | 1 |
# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# ℹ 3,373 more rows
\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
❓ On average, how tall are paintings that are 60 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]
❓ On average, how tall are paintings that are 400 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]
“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.”1
Stephen Colbert, April 6th, 2010
The strength of the fit of a linear model is most commonly evaluated using \(R^2\).
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model.
\(R^2\) is sometimes called the coefficient of determination.
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.683 0.683 8.30 6749. 0 1 -11083. 22173. 22191.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
[1] 0.6829468
Roughly 68% of the variability in heights of paintings can be explained by their widths.
❓ Describe the relationship between price and width of paintings whose width is less than 100in.
❓ Which plot shows a more linear relationship?
❓ Which plot shows a residuals that are uncorrelated with predicted values from the model?
❓What’s the unit of residuals?
price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.log function in R is the natural log: log(x, base = exp(1))❓ How do we interpret the slope of this model?
\[ \widehat{log(price)} = 4.67 + 0.02 Width \]
The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.
\[ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]
\[ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]
\[ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]
\[ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 \]
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 4.67
2 Width_in 0.019
In some cases the value of the response variable might be 0, and
tidymodels approach?